Runtime Prediction for Scale-Out Data Analytics
نویسندگان
چکیده
Many analytics applications generate mixed workloads, i.e., workloads comprised of analytical tasks with different processing characteristics including data pre-processing, SQL, and iterative machine learning algorithms. Examples of such mixed workloads can be found in web data analysis, social media analysis, and graph analytics, where they are executed repetitively on large input datasets (e.g., Find the average user time spent on the top 10 most popular web pages on the UK domain web graph.). Scale-out processing engines satisfy the needs of these applications by distributing the data and the processing task efficiently among multiple workers that are first reserved and then used to execute the task in parallel on a cluster of machines. Finding the resource allocation that can complete the workload execution within a given time constraint, and optimizing cluster resource allocations among multiple analytical workloads motivates the need for estimating the runtime of the workload before its actual execution. Predicting runtime of analytical workloads is a challenging problem as runtime depends on a large number of factors that are hard to model a priori execution. These factors can be summarized as workload characteristics (data statistics and processing costs) , the execution configuration (deployment, resource allocation, and software settings), and the cost model that captures the interplay among all of the above parameters. While conventional cost models proposed in the context of query optimization can assess the relative order among alternative SQL query plans, they are not aimed to estimate absolute runtime. Additionally, conventional models are ill-equipped to estimate the runtime of iterative analytics that are executed repetitively until convergence and that of user defined data pre-processing operators which are not “owned” by the underlying data management system. This thesis demonstrates that runtime for data analytics can be predicted accurately by breaking the analytical tasks into multiple processing phases, collecting key input features during a reference execution on a sample of the dataset, and then using the features to build per-phase cost models. We develop prediction models for three categories of data analytics produced by social media applications: iterative machine learning, data pre-processing, and reporting SQL. The prediction framework for iterative analytics, PREDIcT, addresses the challenging problem of estimating the number of iterations, and per-iteration runtime for a class of iterative machine learning algorithms that are run repetitively until convergence. The hybrid prediction models we develop for data pre-processing tasks and for reporting SQL combine the benefits of analytical modeling with that of machine learning-based models. Through a
منابع مشابه
PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics
Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Opt...
متن کاملIntegrating Modeling Languages and Web Logs for Enhanced User Behavior Analytics
While basic Web analytics tools are widespread and provide statistics about Web site navigation, no approaches exist for merging such statistics with information about the Web application structure, content and semantics. We demonstrate the advantages of combining Web application models with runtime navigation logs, at the purpose of deepening the understanding of users behaviour. We propose a ...
متن کاملTowards Predicting the Runtime of Iterative Analytics with PREDIcT
Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Opt...
متن کاملFlexAnalytics: A Flexible Data Analytics Framework for Big Data Applications with I/O Performance Improvement
a r t i c l e i n f o a b s t r a c t Increasingly larger scale applications are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. Instead of moving data from its source to the output storage, in-situ analytics processes output data while simulations are running...
متن کاملTowards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System
Geo-distributed data analytics are increasingly common to derive useful information in large organisations. Naive extension of existing cluster-scale data analytics systems to the scale of geo-distributed data centers faces unique challenges including WAN bandwidth limits, regulatory constraints, changeable/unreliable runtime environment, and monetary costs. Our goal in this work is to develop ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015